Lexical Co-occurrence, Statistical Significance, and Word Association
نویسندگان
چکیده
Lexical co-occurrence is an important cue for detecting word associations. We present a theoretical framework for discovering statistically significant lexical co-occurrences from a given corpus. In contrast with the prevalent practice of giving weightage to unigram frequencies, we focus only on the documents containing both the terms (of a candidate bigram). We detect biases in span distributions of associated words, while being agnostic to variations in global unigram frequencies. Our framework has the fidelity to distinguish different classes of lexical co-occurrences, based on strengths of the document and corpuslevel cues of co-occurrence in the data. We perform extensive experiments on benchmark data sets to study the performance of various co-occurrence measures that are currently known in literature. We find that a relatively obscure measure called Ochiai, and a newly introduced measure CSA capture the notion of lexical co-occurrence best, followed next by LLR, Dice, and TTest, while another popular measure, PMI, suprisingly, performs poorly in the context of lexical co-occurrence.
منابع مشابه
Choosing the Word Most Typical in Context Using a Lexical Co-Occurrence Network
This paper presents a partial solution to a component of the problem of lexical choice: choosing the synonym most typical, or expected, in context. We apply a new statistical approach to representing the context of a word through lexical co-occurrence networks. The implementation was trained and evaluated on a large corpus, and results show that the inclusion of second-order co-occurrence relat...
متن کاملMeasuring syntagmatic Fixedness of Multi-Word Expressions
Syntagmatic fixedness is an important feature of multi-word expressions (MWE). However, syntagmatic fixedness is gradual and various semantic and syntactic relations hold among the parts of MWEs. This poses intriguing problems for lexicography, linguistic description and language processing. In this paper we propose a computationally inexpensive and intuitive approach to the measurement of synt...
متن کاملImproving Pointwise Mutual Information (PMI) by Incorporating Significant Co-occurrence
We design a new co-occurrence based word association measure by incorporating the concept of significant cooccurrence in the popular word association measure Pointwise Mutual Information (PMI). By extensive experiments with a large number of publicly available datasets we show that the newly introduced measure performs better than other co-occurrence based measures and despite being resource-li...
متن کاملThe Computation of Word Associations: Comparing Syntagmatic and Paradigmatic Approaches
It is shown that basic language processes such as the production of free word associations and the generation of synonyms can be simulated using statistical models that analyze the distribution of words in large text corpora. According to the law of association by contiguity, the acquisition of word associations can be explained by Hebbian learning. The free word associations as produced by sub...
متن کاملYou shall know an object by the company it keeps: An investigation of semantic representations derived from object co-occurrence in visual scenes
An influential position in lexical semantics holds that semantic representations for words can be derived through analysis of patterns of lexical co-occurrence in large language corpora. Firth (1957) famously summarised this principle as "you shall know a word by the company it keeps". We explored whether the same principle could be applied to non-verbal patterns of object co-occurrence in natu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011